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Abstract 

Bibliometric indexes are customary used in evaluating the impact of scientific re- 
search, even though it is very well known that in different research areas they may 
range in very different intervals. Sometimes, this is evident even within a single 
given field of investigation making very difficult (and inaccurate) the assessment 
of scientific papers. On the other hand, the problem can be recast in the same 
framework which has allowed to efficiently cope with the ordering of web-pages, i.e., 
to formulate the PageRank of Google. For this reason, we call such problem the 
PaperRank problem, here solved by using a similar approach to that employed by 
PageRank. The obtained solution, which is mathematically grounded, will be used 
to compare the usual heuristics of the number of citations with a new one here 
proposed. Some numerical tests show that the new heuristics is much more reliable 
than the currently used ones, based on the bare number of citations. Moreover, we 
show that our model improves on recently proposed ones [3]. 

Key words: Bibliometric indexes, PageRank, citations, H-index, normalized 
citations. 



1 Introduction 

In recent years, it has become the fashion to evaluate the impact of research by 
using bibliometric indexes. This approach clearly does not solve the problem 
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at hand, though it can be useful to have a gross idea about specific issues. 
For instance, the quality of a scientist is sometimes ranked by using the so 
called H-index [IT], though it is very well known that this can be useful only 
for analysing short time return researches, whereas it could be completely 
inadequate to assess more basic research fields which, in turn, prove to be 
important (and, sometimes, priceless) only after decades or centuries. As an 
example, the little Fermat theorem, on which modern electronic secure trans- 
actions essentially rely, dates back to 1640, when it was apparently useless. 
Also, very important mathematicians are known to have very small H-index 
(e.g., Galois has an H-index equal to 2 . . . ). 

Nevertheless, in some circumstances, bibliometric indexes allow to obtain a 
rough idea about the impact of research, though their value heavily depends 
on the chosen index. In particular, it is recognized that the bare number of 
citations is a parameter which has many drawbacks: it does depend on the 
specific field of research, on unfair behaviors (which, unfortunately, are not 
unknown in the scientific setting), etc. 

On the other hand, this problem is known to be structurally similar to that 
of ranking urls on the web. As is well known, this latter problem has been 
formalized in the Google PageRank j5] (see also [16J). Based on the simple 
idea that the importance of a web page depends on the number of web pages 
that link to it and on their relative importance, the PageRank relies on a solid 
mathematical basis which allows the search engine Google [TU] to efficiently 
recover information across the web (see also [TTfT] for a deeper mathematical 
analysis of the corresponding matrix problem, and [2]|3] for generalizations). 
Approaches based on this idea have been used for evaluating the impact of 
scientific journals (see, e.g., [HH]) and the impact of scientific articles (see, e.g., 
[B"jl5|ll4fl3] ). also taking into account of collaborations [5]. Sometimes, how- 
ever, the above procedures are not mathematically well refined. In any case, 
such approaches use a global information which could be difficult to recover 
and manage efficiently (it is enough thinking to the computation of the Google 
PageRank to realize the possible complexity of the problem). Indeed, numer- 
ical algorithms need to be finely tuned, in order to gain efficiency (see, e.g., 
[HUE]). This is the main reason why heuristics, like the number of citations 
(which are relatively easy to compute), have become popular, even though, 
as pointed out above, sometimes they may provide misleading advices. Con- 
sequently, more efficient heuristics would be desirable for dealing with the 
problem. 

With this premise, in this paper we provide the model for constructing a math- 
ematically grounded ranking of what we call the PaperRank problerrp"] which, 
under some mild assumptions, is proved to exist and to be unique. The results 
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of this model are more reliable than those given by the model recently pro- 
posed in [3] , and will be therefore assumed as "reference solutions" to validate 
a new heuristics, of local nature. The numerical examples here provided then 
clearly show that in many significant instances the new heuristics is much more 
reliable and fair than the usual one based on the bare number of citations. 



2 The Random Reader Model for the PaperRank Problem 



The principle that we here describe is the analogous of that used in [3] for 
deriving the famous PageRank of Google. For this reason, we name the problem 
PaperRank problem. We would like to remind that the mathematically based 
theory underlying the definition of the Google PageRank is the reason for its 
effectiveness in retrieving informations across the web. In the present setting, 
its basic principle may be then reformulated as follows: 

"an important paper is cited by important papers. " 

That is, in analogy with the random surfer model proposed for the PageR- 
ank problem, we now have a virtual random reader, which starts reading a 
paper, then randomly passing to read a paper cited in it. If we repeat this 
process indefinitely, the importance of a given paper is the fraction of time 
that the random reader spends in reading it (assuming, obviously, that each 
paper is read in a constant time). This principle can be formally modeled by 
introducing the following citation matri^] 

L = (£ tJ ) e R NxN , l l3 



1 if paper j cites paper i 

■ V/../ 1 V. 

otherwise 

(11 



Remark 1 We observe that, by introducing the unit vector 

e = (l,...,lf el^, 

then the vector containing the number of citations of each paper is given by 



\ CN / 



Le 



(2) 



2 Actually, by looking at the papers as the nodes of an oriented graph, such a 
matrix is nothing but the transpose of the adjacency matrix. 
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(i.e., Ci is the number of citations of the ith paper). Such vector is currently 
used for computing several bibliometric indexes such as, e.g., the H-index. 

On the other hand, the vector 

f T =(f 1 ,...,f N )=e T L 

contains the number of bibliographic items in each paper. That is, fj is the 
number of references in paper j, \/j = 1, . . . , N. If we then define V{ as the 
importance of the ith paper, then 



N 



f, ' if/i>0 



j=i I otherwise 

By introducing the vectors 

v = ( Vl ,...,v N ) T , /+ = (/+. 

and the matrices 

F = diag(/), F + = diag(/ + ), (3) 
the previous set of equations can be cast in vector form as 

v = LF + v = Sv. (4) 

However, this ranking could not exist or might be not unique, depending 
whether 1 G o~(S), and/or if this eigenvalue is simple. 

In order to cope with this problem, in [3] the authors introduce, in a simi- 
lar model, a dummy paper, say 0, which references all the other ones and is 
referenced by all of themj^] That is, matrix L is replaced by the augmented 
matrix 

L= f° Gl WxW . (5) 
Matrix S as in Q is then defined accordingly: 

S = LF + ee L , (6) 

V Cr + JTV 

with F the diagonal matrix defined in ^ and / the identity matrix of dimen- 
sion N. Moreover, matrix ^ is clearly irreducible. 

3 We here consider only the problem of ranking the papers, whereas in [3] a more 
general problem is modeled. 
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However, this last feature, makes the model not very faithful, in that it is quite 
well known that there exist groups of papers, whose citations do not overlap, 
so that matrix L is indeed reducible. This is often the case, for example, of 
different fields of research within the same discipline or in different ones. For 
this reason, we here propose a different solution to this problem, in which we 
assume that, by default, each paper references itself, that is la = 1, for all 
i — 1, .... N, so that (see (|3l) 



fj>l, j = l,...,N. 

Consequently, 



e T S = e T LF + = / T F _1 



e 



T 



so that 1 G cr(S) and, evidently, the possible reducibility of the original matrix 
(jl| is retained by the modified one. 

Concerning the fact that ranking is unique, by following similar steps as those 
used for the Google PageRank, we may assume that, having reached a given 
paper, the random reader chooses with probability p 6 (0, 1) a paper cited in 
it, or it jumps to read at random any paper, with probability 1 — p. In vector 
form this reads: 

v = S(p)v = (pS + ee T ) v, pe (0, 1). (7) 

Since 

S{p)>0, \\S(p)\\i = 1, Vpe(0,l), 
from the Perron- Frobenius Theorem (see, e.g., [12]), one easily deduces that 
1 G o~(S(p)), which is a simple eigenvalue, separating in modulus all other 
eigenvalues of S(p). In addition, the corresponding eigenvector v > 0. We 
then conclude that the PaperRank problem ^ admits a solution, which is 
feasible (i.e., with positive entries) and unique. Moreover, by choosing p « 1, 
S(p) ~ S(l) ee S and, therefore, the approximate model well matches the 



original one (see also Section 2.1 below). 



Consequently, this ranking is rigorously mathematically grounded, though it 
requires an information of global nature, alike the case for computing the 
Google PageRank. This means that it is relatively costly, since it requires to 
know all the data about every bibliographical item. 

Remark 2 We observe that this last step (i.e., the introduction of the param- 
eter p) is not required for the matrix S (see ^) of the model derived from 
pp, since one easily proves the following result. 



Theorem 1 Let L ^ and S defined according to |^). Then 5* 4 > 0. 

In other words, there always exists a path of exact length 4 between any two 
of the nodes 0, . . . , N, provided that L ^ 0. 
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2.1 Perturbation analysis 



In this section we provide a simple analysis showing how the introduction 
of the parameter p in ^ affects the original vector. For this purpose, let us 
denote the eigenvector as v(p). That is, 

S{p)v{p)=v{p), pe(0,l). 

Clearly, v* = v(l) is the correct limit vector, which obviously exists, whereas 
v(Q) = he. Consequently, an estimate for v'(p) is given by 

v' ps v(l) - v(0) = v* - — e. 
One then obtains that, for p « 1: 

1 — p 

v(p) Rj + (p — l)u = p?/ H — — e. (8) 

From rt8J) one then concludes that the introduction of the parameter p results 
in an almost uniform (small) perturbation of the entries of the correct vector. 
As a matter of fact, the statistical properties of the two vectors are practically 
the same, for all the test problems reported in Section [3j 



2.2 A New Heuristics for the Paper Rank Problem 



As it has been shown in the previous section, the correct PaperRank is obtained 
by starting from the scaled matrix LF + , in place of L. Similarly, instead of 
considering the bare number of citations, given by the vector (J2|, we propose 
to use normalized citations, defined as the entries of the vector (see (121) 



Cnorm LF G — iS*G, (9) 

which requires, as (pi), only information of local nature. In other words, in place 
of counting the number of citation to a given paper, we propose to consider 
the number of citations to that paper, each divided by the number of references 
in the corresponding paper containing the citation itself. It is obvious that the 
index (|9]) has the same complexity as (pi). Nevertheless, in the next section we 
show that the statistical properties of (|9| are more fair than those of (|2]), in the 
sense that they better reproduce the correct ones provided by the reference 
model 0. 

It is evident, from the definition ([9j, that the vector c norm is essentially the 
first iterate of the power method applied to S, by starting from a constant 
vector. Consequently, 
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• it requires only a local information, even though it would aim to approxi- 
mate, in the limit, a global one (i.e., the PaperRank); 

• no more than one iteration is possible, without requiring a global informa- 
tion. 

Consequently, the heuristics ^ is the best we can do by using only local 
information. Nonetheless, as is shown in the numerical tests, it proves to be 
quite effective. 

Remark 3 It is worth mentioning that the use of the normalized citations 
|5p also copes correctly with the problem of self-citations. Indeed, for each new 
published paper, the normalized additional (self-) citations of an author cannot 
exceed 1. On the contrary, they are virtually unbounded for the vector M) of 
bare citations. 



3 Numerical Tests 

We here provide a few numerical tests, each one modeling a significant sit- 
uation, to compare the PaperRank obtained from ^ by choosing p = 0.99 
(which we assume to be the reference one), with those given by the model (j2|, 
based on bare citations, and ([9]), based on the normalized citations. For each 
test we plot three histograms which rank the papers according to the vec- 
tors representing these indexes. For ease of a direct comparison, the obtained 
vectors are normalized so that their values range in the interval [0, 1]. 

Moreover, for each problem we also compare the PaperRank obtained from ^ 
with that obtained from ([6]), that is from the model proposed in [3J. In fact, 
we shall see that they may significantly differ, due to the fact that our model 
preserves the possible reducibility of the matrix Q. On the other hand, it is 
clear that the vectors ^ and ^ derived from the two matrices Q and ^ 
are essentially the same (obviously, by neglecting the first entry of the vector, 
in the second case). 

Examples 1 and 2 

We suppose to have a single and homogeneous group of 500 articles (first 
example) or two distinct and homogeneous groups with 300 and 700 articles, 
respectively (second example). In both cases each paper has a mean of 20 
randomly distributed references in its own group (see the first two plots in 
Figures [l] and [2]) . In both the examples, all the three rankings (J7|), Q, and Q 
have a similar distribution of the relevance of the papers, as is shown in the 
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last three plots in Figures [j]and [2) even though (|9| better fits the distribution 
of Q. In Fi gure \2\ the green bars concern the first group of papers, whereas 
the blue ones concern the second group. 

For these problems, the PaperRanks obtained from ^ and (|6| turn out to be 
similar each other, so that we do not report the latter ones. Indeed, in the 
first example, the matrix is irreducible and in the second example reducibility 
is not an important feature, since the two blocks have similar properties (i.e., 
the same number of mean references in each paper). 

Examples 3 and 4 

In these examples, we have two distinct and homogeneous groups of papers, 
with 300 and 700 items, respectively. Each paper only cites articles in its 
own group. In the Example 3, each paper in the first group has a mean of 
10 randomly distributed references, whereas each paper in the second group 
has a mean of 70 randomly distributed references (see the first two plots in 
Figure [3]). In Example 4, the situation is reversed since the papers in the 
smaller group have a mean of 70 randomly distributed citations whereas each 
paper in the second group has a mean of 10 randomly distributed references 
(see the first two plots in Figure [5]). In the last three plots of Figures [3] and [ij 
the green bars concern the first group of papers, whereas the blue ones concern 
the second group. The rankings ^ and ^ always have a similar distribution 
of the relevance of the papers, whereas the ranking ^ exhibits two peaks (one 
for the first group and one for the second group), which exchange in the two 
cases. It is clear that in these situations the different number of references in 
the papers of each group invalidate the ranking ([2]), whereas it doesn't affect 
the normalized ranking Q. 

As one may expect, for these problems, there is a significant difference be- 
tween the PaperRanks obtained from our model ([7]) and that derived from 
(j6|. Indeed, in both cases reducibility turns out to be an important feature 
of the corresponding citation matrices. This is shown in Figures [5] and [6j re- 
spectively, for the two examples. In each figure, the upper plot reproduces the 
third subplot in Figures [3] and |4j respectively, whereas the lower plot depicts 
the corresponding PaperRank derived from (|6]). It is evident that the latter 
one does not allow to properly compare the papers in the two groups. 

Example 5 

In this example, we have two groups of papers: a larger one, with 900 papers, 
which reference randomly a mean of 20 papers in the same group, and a 
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smaller one of 100 papers, which reference papers in the same group, with 
a mean of 50 citations, and the papers in the larger one, with a mean of 20 
citations (see the first two plots in Figure [7]). In this case, the heuristics pi), 
based on the bare number of citations, recognizes two groups of papers, the 
most important being the smaller one (the green bars in the fourth plot of 
Figure [7]). The correct distribution, however, is that depicted in the third plot 
of Figure [7j given by (JTl), where the green bars are the leftmost (i.e., the less 
important) ones. This behaviour is qualitatively better reproduced in the last 
plot of Figure [7J concerning the heuristics (J9|. 

For this problem, the PaperRanks obtained from ^ and ^ turn out to be 
similar each other, since the citation matrix turns out to be irreducible. Con- 
sequently, we do not report the latter one. 



Example 6 



The last example concerns the case of three groups of papers: 

• a group of 200 leader papers, which randomly reference a mean of 20 papers 
in the same group; 

• a group of 200 papers, which randomly reference a mean of 20 leader papers 
and 20 papers in its own group; 

• a group of 400 papers, which randomly reference a mean of 20 leader papers 
and 100 papers in its own group; 

This situation is summarized by the first two plots in Figure [8] It is evident 
that the correct ranking is that depicted in the third plot of Figure [8j repre- 
senting the vector in ([7]), with the leader papers (in red) more important than 
those in the second group (in green) and those in the third group (in blue), 
these latter having the same importance. This situation is qualitatively well 
reproduced by the new heuristics (|9]), as is shown in the last plot of Figure |8j 
where the leader papers (red) are again the most important, and the other 
ones (green and blue) have a comparable importance, though the blue ones 
are slightly oversized. Vice versa, the usual ranking (|2]), based on the bare 
number of citations (which is shown in the fourth plot of Figure [8]), depicts 
a wrong scenario, in which the leader papers are replaced by those with the 
highest number of internal references (third group). 

For this problem, the PaperRanks obtained from ([7]) and (|6]) turn out to be 
similar each other, since the citation matrix turns out to be irreducible. Con- 
sequently, we do not report the latter one. 
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4 Conclusions 



In this paper, we provide a mathematically correct definition of the Paper- 
Rank problem to assess scientific papers, which is able to properly compare 
also papers in disjoint groups, thus improving on a recent model proposed in 
[3] . On the basis of this new model, we provide a local heuristics, based on nor- 
malized citations, which appears to be quite effective (though much cheaper 
to compute), allowing to overcome some well known drawbacks of the ranking 
based on the bare number of citations. 
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Fig. 1. citation matrix Q, number of references Q, and rankings ([7]), Q, and 
for Example 1, respectively. 
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Fig. 3. citation matrix Q, number of references Q, and rankings Q, and ^ 
for Example 3, respectively. 
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